1. Global flood records dataset analysis

As a preliminar analysis, we decided to visualize the relationship between the flood characteristics and the number of people displaced in the Global Flood record database. Firstly, we visualized the distribution of each variable:
Except for severity, other variables appear skewed, so we log-transformed them and then centered and scaled them. Next, we performed multiple linear regression with "displaced" as dependent variable and other three variables as independent variables. This model seems to explain ~19% of number of people displaced.

     1 .  regress std_log_displaced severity std_log_affected std_log_duration
      
            Source |       SS       df       MS              Number of obs =    3034
      -------------+------------------------------           F(  3,  3030) =  239.40
             Model |  581.158345     3  193.719448           Prob > F      =  0.0000
          Residual |  2451.84164  3030  .809188658           R-squared     =  0.1916
      -------------+------------------------------           Adj R-squared =  0.1908
             Total |  3032.99998  3033  .999999994           Root MSE      =  .89955
      
      ----------------------------------------------------------------------------------
      std_log_displa~d |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -----------------+----------------------------------------------------------------
              severity |   .1240879   .0421517     2.94   0.003     .0414392    .2067367
      std_log_affected |   .1801437   .0188169     9.57   0.000     .1432485     .217039
      std_log_duration |   .3070196   .0184443    16.65   0.000     .2708549    .3431842
                 _cons |  -.2192853   .0545864    -4.02   0.000    -.3263154   -.1122551
      ----------------------------------------------------------------------------------
      
      
      
      

Next, we proceeded to check if the assumptions for linear regression hold:

2. Perform principal components analysis on "affected", "severity", and "duration"

Estimate principal components.

     3 .  pca severity std_log_affected std_log_duration
      
      Principal components/correlation                  Number of obs    =      4312
                                                        Number of comp.  =         3
                                                        Trace            =         3
          Rotation: (unrotated = principal)             Rho              =    1.0000
      
          --------------------------------------------------------------------------
             Component |   Eigenvalue   Difference         Proportion   Cumulative
          -------------+------------------------------------------------------------
                 Comp1 |      1.58649      .701889             0.5288       0.5288
                 Comp2 |      .884599      .355685             0.2949       0.8237
                 Comp3 |      .528914            .             0.1763       1.0000
          --------------------------------------------------------------------------
      
      Principal components (eigenvectors) 
      
          ----------------------------------------------------------
              Variable |    Comp1     Comp2     Comp3 | Unexplained 
          -------------+------------------------------+-------------
              severity |   0.4067    0.9125    0.0443 |           0 
          std_log_af~d |   0.6412   -0.3196    0.6977 |           0 
          std_log_du~n |   0.6508   -0.2554   -0.7150 |           0 
          ----------------------------------------------------------
      
      

Component 1 explains ~53% of variance, so compute score of that component (pc1) and regress on that alone.

As seen, model R2 is similar to earlier model with separate terms for "magnitude" and "duration " (R2 ~0.18 versus ~0.19).

     4 .  regress std_log_displaced pc1
      
            Source |       SS       df       MS              Number of obs =    3034
      -------------+------------------------------           F(  1,  3032) =  660.36
             Model |  542.436732     1  542.436732           Prob > F      =  0.0000
          Residual |  2490.56325  3032  .821425874           R-squared     =  0.1788
      -------------+------------------------------           Adj R-squared =  0.1786
             Total |  3032.99998  3033  .999999994           Root MSE      =  .90633
      
      ------------------------------------------------------------------------------
      std_log_di~d |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               pc1 |   .3320503   .0129215    25.70   0.000     .3067145    .3573861
             _cons |  -.0640488   .0166419    -3.85   0.000    -.0966794   -.0314183
      ------------------------------------------------------------------------------
      
      
      

Regression model with "pc1" appears to have equal or superior R2 to regression models with "duration" alone or "magnitude" alone.

     5 .  regress std_log_displaced severity
      
            Source |       SS       df       MS              Number of obs =    3034
      -------------+------------------------------           F(  1,  3032) =   67.93
             Model |  66.4655162     1  66.4655162           Prob > F      =  0.0000
          Residual |  2966.53446  3032  .978408464           R-squared     =  0.0219
      -------------+------------------------------           Adj R-squared =  0.0216
             Total |  3032.99998  3033  .999999994           Root MSE      =  .98915
      
      ------------------------------------------------------------------------------
      std_log_di~d |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
          severity |   .3714916   .0450724     8.24   0.000      .283116    .4598672
             _cons |  -.4636299   .0590483    -7.85   0.000    -.5794086   -.3478511
      ------------------------------------------------------------------------------
      
      
     6 . regress std_log_displaced std_log_affected
      
            Source |       SS       df       MS              Number of obs =    3034
      -------------+------------------------------           F(  1,  3032) =  374.66
             Model |  333.563804     1  333.563804           Prob > F      =  0.0000
          Residual |  2699.43618  3032  .890315361           R-squared     =  0.1100
      -------------+------------------------------           Adj R-squared =  0.1097
             Total |  3032.99998  3033  .999999994           Root MSE      =  .94357
      
      ----------------------------------------------------------------------------------
      std_log_displa~d |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -----------------+----------------------------------------------------------------
      std_log_affected |   .3362418   .0173714    19.36   0.000     .3021809    .3703027
                 _cons |  -.0289544   .0171955    -1.68   0.092    -.0626704    .0047615
      ----------------------------------------------------------------------------------
      
      
     7 . regress std_log_displaced std_log_duration
      
            Source |       SS       df       MS              Number of obs =    3034
      -------------+------------------------------           F(  1,  3032) =  590.56
             Model |  494.448928     1  494.448928           Prob > F      =  0.0000
          Residual |  2538.55105  3032  .837252986           R-squared     =  0.1630
      -------------+------------------------------           Adj R-squared =  0.1627
             Total |  3032.99998  3033  .999999994           Root MSE      =  .91502
      
      ----------------------------------------------------------------------------------
      std_log_displa~d |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -----------------+----------------------------------------------------------------
      std_log_duration |   .3992275   .0164281    24.30   0.000     .3670161    .4314389
                 _cons |   -.063597   .0168168    -3.78   0.000    -.0965705   -.0306235
      ----------------------------------------------------------------------------------
      
      
      
      

Check assumptions of linear regression for model with principal component alone:

Finally, we visually assessed the relationship between multiple flood characteristics and the number of people displaced. Such a visualization would be difficult to perform with traditional linear regression in the presence of multiple independent variables.

Principal components analysis allowed dimension reduction to one dimension, thereby allowing direct visualization. Prediction using one principal component was equally predictive as regression limited to one traditional independent variable and additionally allowed direct visualization between predictor and outcome. However, the disad vantage of this approach is that the first principal component is more difficult to conceptually understand than a traditional predictor such as severity.

For this section, we mainly look at the dataset from Dartmouth Flood Observatory–GlobalFloodsRecord.xls, which documents flood events in all parts of the world since 1985 until early this year–and summarize the data grouped by country. We pick five variables: duration of flood in days, number of death, magnitude of the flood, severity level, and the affected area in squared kilometers; then sum each variable up for each country to get the cumulative flood information for every country.

The country name variable in the dataset is kind of messy, in the way that there are not only NA entries but also misspelled country names. We take care of this issue and group by country using “dplyr” package.

Now that we get a dataset, each of whose row is a country and each column is a feature (total duration in that country, cumulative severity of flood, etc.), we derive the geolocation of each country in terms of longitude and latitude by using geocode() function from “ggmap” package, and attach them after each row.

We have the first plot as following:

This plot displays each country as a circle on the world map, whose size marks the total number of death during the floods happened in that specific country (bigger means more people), and the color indicates the total duration of all floods in days for that country (darker means floods last longer). The darker (i.e., more red) circles are clearly bigger, which demonstrates an intuitive relationship that the longer the floods lasted, the more people would die.

The second plot explores the relationship between the cumulative severity of all floods and the total affected area in squared kilometers:

as what we would expect intuitively, the darker (more red) circles are obviously bigger, which shows a direct relationship that the more area of the floods affected in a country, the more severe those floods would be.

The third plot helps us to verify our intuition that the longer the floods lasted in a country, the more severe those floods would be:

Next plot investigates the relationship between the cumulative severity of the floods in a country and the total number of death during those floods in that country:

It turns out that these two features also have a direct proportional relationship.

Lastly, we question whether this kind of relationship exists between the cumulative magnitude of floods and the number of death during floods in a specific country. Since magnitude is calculated by the formula \(\text{magnitude}=\text{log }(\text{duration}\cdot \text{severity}\cdot \text{total affected area}),\) and we just observed from our plots the pairwise directly proportional relationship among the three variables used to calculate the magnitude along with the number of death, we’d expect the relationship between these variables and number of death is also apparent. We confirm this speculation with the following plot:

As a second part, the Global flood record dataset was collapsed into a per-continent view, to view a summary of the most relevant predictors by continent:

It can be seen that Asia is by far the continent most affected by floods, not only in frequency, but also in intensity, severity, death count, magnitude and duration and affected area. On the other hand, Australia is by far the less affected continent on all levels.

5.Principal component analysis on the floods in the UK during 1990

We decided to subset the data and focus our attention in five flood events that occurred in the United Kingdom during 1990. We came up with these events after slicing the Global Floods data over time and generating an animation that allowed us to perfectly tackle big and isolated flood events over time. The interactive animation is shown below:

Transforming the data

Once the entries for the UK floods in 1990 have been located in the Global Flood record, the next step was to locate that information withing the NOAA daily phi database. In order to do this, global flood record dates have been transformed into days since January 1st, 1948, so that the two tables can be correctly matched. The transformed values look like this:

##      Began Ended
## 3837  7663  7666
## 3845  7605  7607
## 3856  7583  7584
## 3924  7361  7363
## 3930  7330  7345

Once the phi data for each flood event days was located, we came up with the following protocol for having five relevant datrices, one per flood:

Each dataset is consituted by a ribbon subset of the phi data, ranging over all longitudes (in order to preserve the direction-shifting pattern of pressure waves) but only ranging on latitudes higher than 45N and lower than 60N. This contitutes a grid of 7x144=1008 cells for every day. In order to keep our datasets as matrices, we used the as.vector() function to create a vector collapsing the 1008 grid cells for every day.

Since R can’t handle matrices with high dimensionality with princomp(), and we also wanted to capture the varying effect of pressure levels over time in several time dimensions, we decided to subset observations in the following way:

  1. For every flood event, we designed an interval() function that, given the start and end dates of a specific flood event, returns a date interval ranging from 30 days before the start of the flood to 30 days after the day of the flood. That is, for a 4-day flood it returns a 64-day interval, and for a 16-day flood, it returns a 76-day interval.
  2. Every interval is analized on all years ranging from 19 years before the year of the flood, to 19 years after the flood. That is, 39 years including the flood year. Combining the protocol we get that, for a 4-day flood event we have a 64-day period times 39 years = 2496 observations. Similarly, for a 16-day flood we have a 76-day period times 39 years = 2964 observations.

Hence, the final dimensions of the 5 matrices vary around 2418 and 2964 rows and all of them have 1008 columns (one per grid cell), as detailed below:

## [1] 2496 1008
## [1] 2457 1008
## [1] 2418 1008
## [1] 2457 1008
## [1] 2964 1008

Once we have the slices for each flood event, we ran PCA on them. Below are displayed the plots showing the variance that each component explains:

It can be seen that floods #1, #4 and #5 have at least 6 significant principal components, whereas floods #2 and #3 have only 2 very significant components. All of them, however, show a considerable difference in explained variance between the first component and all other ones.